Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make list diff reuse _obj_diff results instead of making duplicate calls #74

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

bfrobin446
Copy link

The backtracking pass of the list diff algorithm in _list_diff_0 was calling _obj_diff on pairs of objects that had already been diffed in the forward pass earlier in the _list_diff method.

When I create an array of _obj_diff results in the forward pass and reuse that array in the backtracking pass, I see about a 5x speedup on JSON files that contain lists of a few dozen large objects.

The backtracking pass of the list diff algorithm in `_list_diff_0` was
calling `_obj_diff` on pairs of objects that had already been diffed in
the forward pass earlier in the `_list_diff` method. This commit saves
the `_obj_diff` results from the forward pass and reuses them in the
backtracking pass so we don't duplicate the recursive calls.
@corytodd
Copy link
Collaborator

I'm running some benchmarks to see how this affects performance and it looks like this is noticeably slower which is unexpected. Do you happen to have any benchmarks that you've run? I'd be interested in results for both memory usage and execution time.

I picked a random test that touches this code path, test_long_arrays and bumped the size up to 10K. I see a very consistent 237 seconds for this PR versus a very consistent 161 seconds for the current master branch.

At first glance I suspect this is caused by preallocating the entire 2d array of m x n which requires touching every memory address in the array. In the current algorithm we are a bit lazier and only call _obj_diff while we have both a row and column which are allowed to decrement separately so there is an opportunity for less than m x n iterations.

@corytodd
Copy link
Collaborator

Also thanks for taking to time to make this PR @bfrobin446 :)

@bfrobin446
Copy link
Author

I’d expect the difference to be the size and structure of the array entries. If the child objects are expensive to diff (perhaps especially if there are arrays at several different levels of nesting), the balance will shift in favor of avoiding duplicate work.

I can’t share the files I was profiling on (employer internal), but I’ll see if I can generate a set of files that shows the behavior I was seeing.

@bfrobin446
Copy link
Author

Another option would be to replace the preallocated array with a dictionary that stores only the row/column combinations that the forward pass touched. Once I have a set of reproduction files that I can share, I'll see if that gets us a better compromise between performance in the normal case and performance in the pathological case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants